New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Include example for signal docs #1566
Conversation
@@ -66,7 +66,7 @@ Disabling an extension | |||
|
|||
In order to disable an extension that comes enabled by default (ie. those | |||
included in the :setting:`EXTENSIONS_BASE` setting) you must set its order to | |||
``None``. For example:: | |||
``None``(or ``0``). For example:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @darshanime! Setting 0
won't disable an extension. Compare how all default extensions have a value of 0
, and these are all enabled. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdemaeyer , I got tipped off track by this. In the extensions example, the getbool
method is used to check if the extension is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. In that example, the MYEXT_ENABLED
setting is a boolean (like many similar builtin settings, e.g. LOG_ENABLED
or DNSCACHE_ENABLED
). EXTENSIONS
on the other hand is quite different, in that it is a dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will make the changes. Thanks!
Hey @darshanime, thanks for taking the time to help with the docs. This is a very long example with quite a bit of boilerplate (setting up the Crawler and CrawlerProcess), maybe too much for just showing how to connect to a signal. Perhaps it could be sufficient to assume that the |
Hm, you are right. Maybe I overdid it. from scrapy import signals
def engine_started():
print "###Engine started###"
crawler.signals.connect(engine_started, signal = signals.engine_started) |
Well, starting a crawl via CrawlerProcess is not the most common way to run Scrapy spiders. Usually people use Scrapy projects and |
Hmm, yeah, maybe the |
Yeah, I think that'll work for a CrawlerProcess example. But it is also unclear where to get CrawlerProcess object from when scrapy crawl is used (see #1226). So there should be also an example with |
Thank you for the comments and pardon me for the delay. from scrapy.crawler import Crawler, CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy import signals
from dirbot.spiders import dmoz
#Initiating a crawler object with the dmoz spider
crawler = Crawler(dmoz.DmozSpider, get_project_settings())
#Connecting callback functions to signals
def from_crawler(crawler):
crawler.signals.connect(item_scraped, signal = signals.item_scraped)
def item_scraped(item, spider):
print "##Item scraped by %s##" %spider.name
def main():
from_crawler(crawler)
process = CrawlerProcess()
process.crawl(crawler)
process.start()
if __name__ == '__main__':
main() |
Hey @darshanime. Thanks for working on it! I think the example is not perfect because it won't work if a spider is started using |
Hi @kmike ! I cooked this up but the function class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
crawler.signals.connect(cls.item_scraped, signal = signals.item_scraped)
return super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
def item_scraped(self, item, spider):
print "##Item scraped by %s##" %spider.name
def parse(self, response):
#rest of the spider |
I think the problem is that you're connecting an unbound DmozSpider.item_scraped method as a signal handler. If you think about it, how would Python know what to put in Try this: @classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
return spider |
Yes, indeed ! This works now. |
I like the examples, good work @darshanime! One small thing ;) We try to comply to PEP8 (which is the standard style guide for Python): The indentation level (e.g. after You can see at the bottom of this PR that there are failing checks. Ignore the codecov one (it makes sure that code you put into Scrapy's source is tested through a test in You can check whether the test passes at home by running |
Sure, I'll clean the commit history and get the PR in shape. |
I wouldn't touch the original dmoz spider. Using signals is a bit too much for a "write your first scrapy project" tutorial. |
3209770
to
7c87c3e
Compare
Commit history rewritten. |
Current coverage is 83.47% (diff: 100%)
|
|
||
|
||
def item_scraped(self, item, spider): | ||
print "##Item scraped by %s##" %spider.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An easier way to process items it to use item pipelines. I think it is better to show usage of other signal; spider_closed
is a good candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add spider_closed
along with the current item_scraped
or remove item_scraped
?
Any reason why this PR wasn't merged? I was looking for ways to contribute and found this. There isn't an example for custom signals in the docs as well. |
@mgachhui , looks like this conversation stalled after @darshanime 's question. I would comment on the question by agreeing with @kmike that I had to cook up a similar example the other day on IRC. Having it in the docs, how to properly and idiomatically hook a spider_closed handler makes a lot of sense. |
I will submit an update commit soon, thanks for the review! On Tue, May 10, 2016 at 4:08 PM, Paul Tremberth notifications@github.com
|
3cea35b
to
10477a6
Compare
@redapple can we merge this? |
|
||
|
||
def spider_closed(self, spider): | ||
print "##Spider closed: %s##" %spider.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you modify this print statement to use logging instead? e.g. spider.logger.info
.
Also, print
like this is not Python 3 compatible
10477a6
to
6e390e3
Compare
@redapple, kindly check now! |
LGTM |
:: | ||
|
||
from scrapy import signals | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @darshanime -- this looks good, only one small thing to adjust before merging: it will be nice to add from scrapy import Spider
here and a pass
to the parse method below, so that users don't get an error when trying to run as-is (a quick way to check is to paste the code into a file and run with scrapy runspider file.py
).
Ready to merge after that. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea! fixed in the amended commit.
6e390e3
to
a2e6452
Compare
Thanks @darshanime ! |
The signal docs did not have any examples showing how to connect callback functions to signals. Added an example with this PR (+ a few typos in documentation)
Fixes #1521