Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Include example for signal docs #1566

Merged
merged 1 commit into from Jul 25, 2016

Conversation

@darshanime
Copy link
Contributor

@darshanime darshanime commented Oct 29, 2015

The signal docs did not have any examples showing how to connect callback functions to signals. Added an example with this PR (+ a few typos in documentation)

Fixes #1521

@@ -66,7 +66,7 @@ Disabling an extension

In order to disable an extension that comes enabled by default (ie. those
included in the :setting:`EXTENSIONS_BASE` setting) you must set its order to
``None``. For example::
``None``(or ``0``). For example::

This comment has been minimized.

@jdemaeyer

jdemaeyer Oct 29, 2015
Contributor

hey @darshanime! Setting 0 won't disable an extension. Compare how all default extensions have a value of 0, and these are all enabled. ;)

This comment has been minimized.

@darshanime

darshanime Oct 29, 2015
Author Contributor

@jdemaeyer , I got tipped off track by this. In the extensions example, the getbool method is used to check if the extension is enabled.

This comment has been minimized.

@jdemaeyer

jdemaeyer Oct 29, 2015
Contributor

Oh, I see. In that example, the MYEXT_ENABLED setting is a boolean (like many similar builtin settings, e.g. LOG_ENABLED or DNSCACHE_ENABLED). EXTENSIONS on the other hand is quite different, in that it is a dictionary.

This comment has been minimized.

@darshanime

darshanime Oct 29, 2015
Author Contributor

Right, will make the changes. Thanks!

print "###Engine started###"


def spider_opened():

This comment has been minimized.

@jdemaeyer

jdemaeyer Oct 29, 2015
Contributor

Don't forget the signature here (def spider_opened(spider, reason):), also for all following callbacks.

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Oct 29, 2015

Hey @darshanime, thanks for taking the time to help with the docs.

This is a very long example with quite a bit of boilerplate (setting up the Crawler and CrawlerProcess), maybe too much for just showing how to connect to a signal. Perhaps it could be sufficient to assume that the crawler object already exists and then connect a single callback? Just my two cents though ;)

@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Oct 29, 2015

Hm, you are right. Maybe I overdid it.
How about just a simple :

from scrapy import signals

def engine_started():
    print "###Engine started###"

crawler.signals.connect(engine_started, signal = signals.engine_started)
@kmike
Copy link
Member

@kmike kmike commented Oct 29, 2015

Well, starting a crawl via CrawlerProcess is not the most common way to run Scrapy spiders. Usually people use Scrapy projects and scrapy crawl command; in this case it is not clear where to get crawler object from. I think an example should show it.

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Oct 29, 2015

Hmm, yeah, maybe the crawler should indeed not simply fall from the sky. Perhaps an example including the CrawlerProcess boilerplate but with just a single signal handler (preferably one that has a callback signature, e.g. item_scraped)?

@kmike
Copy link
Member

@kmike kmike commented Oct 29, 2015

Yeah, I think that'll work for a CrawlerProcess example.

But it is also unclear where to get CrawlerProcess object from when scrapy crawl is used (see #1226).

So there should be also an example with .from_crawler method since it is a most common way to get a crawler object.

@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Nov 2, 2015

Thank you for the comments and pardon me for the delay.
How about this ?

from scrapy.crawler import Crawler, CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy import signals

from dirbot.spiders import dmoz

#Initiating a crawler object with the dmoz spider
crawler = Crawler(dmoz.DmozSpider, get_project_settings())

#Connecting callback functions to signals
def from_crawler(crawler):
  crawler.signals.connect(item_scraped, signal = signals.item_scraped)

def item_scraped(item, spider):
    print "##Item scraped by %s##" %spider.name

def main():
  from_crawler(crawler)
  process = CrawlerProcess()
  process.crawl(crawler)
  process.start()

if __name__ == '__main__':
    main()
@kmike
Copy link
Member

@kmike kmike commented Nov 2, 2015

Hey @darshanime. Thanks for working on it!

I think the example is not perfect because it won't work if a spider is started using scrapy crawl or scrapy runspider. Check http://doc.scrapy.org/en/latest/intro/overview.html - this is how a spider is usually defined & executed. To connect a signal without writing a custom start script you have to override spider's from_crawler method. Another way is to connect a signal in start_requests method, but some signals will be already fired when start_request is called. Heh, this shows how bad our signal docs are - it is hard even for a dedicated person to figure this out :)

@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Nov 2, 2015

Hi @kmike !

I cooked this up but the function item_scraped is not firing. Can you help me figure this out ? I am learning Python still :/
(This is the dmoz spider)

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        crawler.signals.connect(cls.item_scraped, signal = signals.item_scraped)
        return super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)

    def item_scraped(self, item, spider):
        print "##Item scraped by %s##" %spider.name

    def parse(self, response):
         #rest of the spider
@kmike
Copy link
Member

@kmike kmike commented Nov 2, 2015

I think the problem is that you're connecting an unbound DmozSpider.item_scraped method as a signal handler. If you think about it, how would Python know what to put in self argument of this method before a spider is created? And what if several spiders of the same class are created?

Try this:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
    return spider
@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Nov 2, 2015

Yes, indeed ! This works now.
Is the example good to go now ?

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Nov 2, 2015

I like the examples, good work @darshanime!

One small thing ;) We try to comply to PEP8 (which is the standard style guide for Python): The indentation level (e.g. after def: or if:) should always be four spaces. And there should be no spaces around the = in keyword arguments.

You can see at the bottom of this PR that there are failing checks. Ignore the codecov one (it makes sure that code you put into Scrapy's source is tested through a test in tests/, but since you only worked on the docs it got confused). The travis-ci one however should always pass. Here it failed because there was "/home/travis/build/scrapy/scrapy/docs/topics/signals.rst:32: ERROR: Unexpected indentation." For Sphinx (the tool building the docs), you need to announce code blocks with a double colon, e.g. catch signals and perform some action::

You can check whether the test passes at home by running tox docs in the scrapy repository.

@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Nov 2, 2015

Sure, I'll clean the commit history and get the PR in shape.
Also, do we modify the original dmoz spider too to include the example ?

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Nov 2, 2015

I wouldn't touch the original dmoz spider. Using signals is a bit too much for a "write your first scrapy project" tutorial.

@darshanime darshanime force-pushed the darshanime:signal-example branch from 3209770 to 7c87c3e Nov 16, 2015
@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Nov 16, 2015

Commit history rewritten.
Pardon for the long delay, I had University exams. Hope to contribute more effectively now onwards.

@codecov-io
Copy link

@codecov-io codecov-io commented Nov 16, 2015

Current coverage is 83.47% (diff: 100%)

Merging #1566 into master will not change coverage

Powered by Codecov. Last update ec1c615...a2e6452



def item_scraped(self, item, spider):
print "##Item scraped by %s##" %spider.name

This comment has been minimized.

@kmike

kmike Nov 23, 2015
Member

An easier way to process items it to use item pipelines. I think it is better to show usage of other signal; spider_closed is a good candidate.

This comment has been minimized.

@darshanime

darshanime Nov 23, 2015
Author Contributor

Should I add spider_closed along with the current item_scraped or remove item_scraped ?

@redapple redapple added the docs label Jan 26, 2016
@ghost
Copy link

@ghost ghost commented May 8, 2016

Any reason why this PR wasn't merged? I was looking for ways to contribute and found this. There isn't an example for custom signals in the docs as well.

@redapple
Copy link
Contributor

@redapple redapple commented May 10, 2016

@mgachhui , looks like this conversation stalled after @darshanime 's question.

I would comment on the question by agreeing with @kmike that spider_closed is a better suited and more common example for using signals in spider classes. So I would remove item_scraped signal handler from the example.

I had to cook up a similar example the other day on IRC. Having it in the docs, how to properly and idiomatically hook a spider_closed handler makes a lot of sense.

@darshanime
Copy link
Contributor Author

@darshanime darshanime commented May 10, 2016

I will submit an update commit soon, thanks for the review!

On Tue, May 10, 2016 at 4:08 PM, Paul Tremberth notifications@github.com
wrote:

@mgachhui https://github.com/mgachhui , looks like this conversation
stalled after @darshanime https://github.com/darshanime 's question.

I would comment on the question by agreeing with @kmike
https://github.com/kmike that spider_closed is a better suited and more
common example for using signals in spider classes. So I would remove
item_scraped signal handler from the example.

I had to cook up a similar example the other day on IRC. Having it in the
docs, how to properly and idiomatically hook a spider_closed handler makes
a lot of sense.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1566 (comment)

@darshanime darshanime force-pushed the darshanime:signal-example branch 2 times, most recently from 3cea35b to 10477a6 Jul 24, 2016
@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Jul 24, 2016

@redapple can we merge this?



def spider_closed(self, spider):
print "##Spider closed: %s##" %spider.name

This comment has been minimized.

@redapple

redapple Jul 25, 2016
Contributor

Could you modify this print statement to use logging instead? e.g. spider.logger.info.
Also, print like this is not Python 3 compatible

@darshanime darshanime force-pushed the darshanime:signal-example branch from 10477a6 to 6e390e3 Jul 25, 2016
@darshanime
Copy link
Contributor Author

@darshanime darshanime commented Jul 25, 2016

@redapple, kindly check now!

@redapple
Copy link
Contributor

@redapple redapple commented Jul 25, 2016

LGTM

@redapple redapple changed the title Include example for signal docs [MRG+1] Include example for signal docs Jul 25, 2016
::

from scrapy import signals

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 25, 2016
Member

Hey, @darshanime -- this looks good, only one small thing to adjust before merging: it will be nice to add from scrapy import Spider here and a pass to the parse method below, so that users don't get an error when trying to run as-is (a quick way to check is to paste the code into a file and run with scrapy runspider file.py).

Ready to merge after that. :)

This comment has been minimized.

@darshanime

darshanime Jul 25, 2016
Author Contributor

nice idea! fixed in the amended commit.

@darshanime darshanime force-pushed the darshanime:signal-example branch from 6e390e3 to a2e6452 Jul 25, 2016
@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jul 25, 2016

Thanks @darshanime !

@eliasdorneles eliasdorneles merged commit b6375d3 into scrapy:master Jul 25, 2016
2 checks passed
2 checks passed
codecov/patch Coverage not affected when comparing ec1c615...a2e6452
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@darshanime darshanime deleted the darshanime:signal-example branch Jul 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants