Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Include example for signal docs #1566

Merged
merged 1 commit into from Jul 25, 2016

Conversation

darshanime
Copy link
Contributor

The signal docs did not have any examples showing how to connect callback functions to signals. Added an example with this PR (+ a few typos in documentation)

Fixes #1521

@@ -66,7 +66,7 @@ Disabling an extension

In order to disable an extension that comes enabled by default (ie. those
included in the :setting:`EXTENSIONS_BASE` setting) you must set its order to
``None``. For example::
``None``(or ``0``). For example::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @darshanime! Setting 0 won't disable an extension. Compare how all default extensions have a value of 0, and these are all enabled. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdemaeyer , I got tipped off track by this. In the extensions example, the getbool method is used to check if the extension is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. In that example, the MYEXT_ENABLED setting is a boolean (like many similar builtin settings, e.g. LOG_ENABLED or DNSCACHE_ENABLED). EXTENSIONS on the other hand is quite different, in that it is a dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will make the changes. Thanks!

@jdemaeyer
Copy link
Contributor

Hey @darshanime, thanks for taking the time to help with the docs.

This is a very long example with quite a bit of boilerplate (setting up the Crawler and CrawlerProcess), maybe too much for just showing how to connect to a signal. Perhaps it could be sufficient to assume that the crawler object already exists and then connect a single callback? Just my two cents though ;)

@darshanime
Copy link
Contributor Author

Hm, you are right. Maybe I overdid it.
How about just a simple :

from scrapy import signals

def engine_started():
    print "###Engine started###"

crawler.signals.connect(engine_started, signal = signals.engine_started)

@kmike
Copy link
Member

kmike commented Oct 29, 2015

Well, starting a crawl via CrawlerProcess is not the most common way to run Scrapy spiders. Usually people use Scrapy projects and scrapy crawl command; in this case it is not clear where to get crawler object from. I think an example should show it.

@jdemaeyer
Copy link
Contributor

Hmm, yeah, maybe the crawler should indeed not simply fall from the sky. Perhaps an example including the CrawlerProcess boilerplate but with just a single signal handler (preferably one that has a callback signature, e.g. item_scraped)?

@kmike
Copy link
Member

kmike commented Oct 29, 2015

Yeah, I think that'll work for a CrawlerProcess example.

But it is also unclear where to get CrawlerProcess object from when scrapy crawl is used (see #1226).

So there should be also an example with .from_crawler method since it is a most common way to get a crawler object.

@darshanime
Copy link
Contributor Author

Thank you for the comments and pardon me for the delay.
How about this ?

from scrapy.crawler import Crawler, CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy import signals

from dirbot.spiders import dmoz

#Initiating a crawler object with the dmoz spider
crawler = Crawler(dmoz.DmozSpider, get_project_settings())

#Connecting callback functions to signals
def from_crawler(crawler):
  crawler.signals.connect(item_scraped, signal = signals.item_scraped)

def item_scraped(item, spider):
    print "##Item scraped by %s##" %spider.name

def main():
  from_crawler(crawler)
  process = CrawlerProcess()
  process.crawl(crawler)
  process.start()

if __name__ == '__main__':
    main()

@kmike
Copy link
Member

kmike commented Nov 2, 2015

Hey @darshanime. Thanks for working on it!

I think the example is not perfect because it won't work if a spider is started using scrapy crawl or scrapy runspider. Check http://doc.scrapy.org/en/latest/intro/overview.html - this is how a spider is usually defined & executed. To connect a signal without writing a custom start script you have to override spider's from_crawler method. Another way is to connect a signal in start_requests method, but some signals will be already fired when start_request is called. Heh, this shows how bad our signal docs are - it is hard even for a dedicated person to figure this out :)

@darshanime
Copy link
Contributor Author

Hi @kmike !

I cooked this up but the function item_scraped is not firing. Can you help me figure this out ? I am learning Python still :/
(This is the dmoz spider)

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        crawler.signals.connect(cls.item_scraped, signal = signals.item_scraped)
        return super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)

    def item_scraped(self, item, spider):
        print "##Item scraped by %s##" %spider.name

    def parse(self, response):
         #rest of the spider

@kmike
Copy link
Member

kmike commented Nov 2, 2015

I think the problem is that you're connecting an unbound DmozSpider.item_scraped method as a signal handler. If you think about it, how would Python know what to put in self argument of this method before a spider is created? And what if several spiders of the same class are created?

Try this:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
    return spider

@darshanime
Copy link
Contributor Author

Yes, indeed ! This works now.
Is the example good to go now ?

@jdemaeyer
Copy link
Contributor

I like the examples, good work @darshanime!

One small thing ;) We try to comply to PEP8 (which is the standard style guide for Python): The indentation level (e.g. after def: or if:) should always be four spaces. And there should be no spaces around the = in keyword arguments.

You can see at the bottom of this PR that there are failing checks. Ignore the codecov one (it makes sure that code you put into Scrapy's source is tested through a test in tests/, but since you only worked on the docs it got confused). The travis-ci one however should always pass. Here it failed because there was "/home/travis/build/scrapy/scrapy/docs/topics/signals.rst:32: ERROR: Unexpected indentation." For Sphinx (the tool building the docs), you need to announce code blocks with a double colon, e.g. catch signals and perform some action::

You can check whether the test passes at home by running tox docs in the scrapy repository.

@darshanime
Copy link
Contributor Author

Sure, I'll clean the commit history and get the PR in shape.
Also, do we modify the original dmoz spider too to include the example ?

@jdemaeyer
Copy link
Contributor

I wouldn't touch the original dmoz spider. Using signals is a bit too much for a "write your first scrapy project" tutorial.

@darshanime
Copy link
Contributor Author

Commit history rewritten.
Pardon for the long delay, I had University exams. Hope to contribute more effectively now onwards.

@codecov-io
Copy link

codecov-io commented Nov 16, 2015

Current coverage is 83.47% (diff: 100%)

Merging #1566 into master will not change coverage

Powered by Codecov. Last update ec1c615...a2e6452



def item_scraped(self, item, spider):
print "##Item scraped by %s##" %spider.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An easier way to process items it to use item pipelines. I think it is better to show usage of other signal; spider_closed is a good candidate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add spider_closed along with the current item_scraped or remove item_scraped ?

@redapple redapple added the docs label Jan 26, 2016
@ghost
Copy link

ghost commented May 8, 2016

Any reason why this PR wasn't merged? I was looking for ways to contribute and found this. There isn't an example for custom signals in the docs as well.

@redapple
Copy link
Contributor

@mgachhui , looks like this conversation stalled after @darshanime 's question.

I would comment on the question by agreeing with @kmike that spider_closed is a better suited and more common example for using signals in spider classes. So I would remove item_scraped signal handler from the example.

I had to cook up a similar example the other day on IRC. Having it in the docs, how to properly and idiomatically hook a spider_closed handler makes a lot of sense.

@darshanime
Copy link
Contributor Author

I will submit an update commit soon, thanks for the review!

On Tue, May 10, 2016 at 4:08 PM, Paul Tremberth notifications@github.com
wrote:

@mgachhui https://github.com/mgachhui , looks like this conversation
stalled after @darshanime https://github.com/darshanime 's question.

I would comment on the question by agreeing with @kmike
https://github.com/kmike that spider_closed is a better suited and more
common example for using signals in spider classes. So I would remove
item_scraped signal handler from the example.

I had to cook up a similar example the other day on IRC. Having it in the
docs, how to properly and idiomatically hook a spider_closed handler makes
a lot of sense.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1566 (comment)

@darshanime darshanime force-pushed the signal-example branch 2 times, most recently from 3cea35b to 10477a6 Compare July 24, 2016 14:14
@darshanime
Copy link
Contributor Author

@redapple can we merge this?



def spider_closed(self, spider):
print "##Spider closed: %s##" %spider.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you modify this print statement to use logging instead? e.g. spider.logger.info.
Also, print like this is not Python 3 compatible

@darshanime
Copy link
Contributor Author

@redapple, kindly check now!

@redapple
Copy link
Contributor

LGTM

@redapple redapple changed the title Include example for signal docs [MRG+1] Include example for signal docs Jul 25, 2016
::

from scrapy import signals

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @darshanime -- this looks good, only one small thing to adjust before merging: it will be nice to add from scrapy import Spider here and a pass to the parse method below, so that users don't get an error when trying to run as-is (a quick way to check is to paste the code into a file and run with scrapy runspider file.py).

Ready to merge after that. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea! fixed in the amended commit.

@eliasdorneles
Copy link
Member

Thanks @darshanime !

@eliasdorneles eliasdorneles merged commit b6375d3 into scrapy:master Jul 25, 2016
@darshanime darshanime deleted the signal-example branch July 26, 2016 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants