Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC API cleanup #816

Merged
merged 18 commits into from Sep 2, 2014
Merged

GSoC API cleanup #816

merged 18 commits into from Sep 2, 2014

Conversation

@curita
Copy link
Member

@curita curita commented Jul 24, 2014

API cleanup task of the 2014 GSoC.

Tasks Status

  • Settings
    • Freeze and copy support
    • Document
    • Test
  • Spider
    • Add from_crawler method
    • Deprecate set_crawler
    • Document
    • Test
  • Crawler
    • Instantiate with a required spider class (along the settings)
    • Crawl method will create a spider object with the provided arguments
    • Merge the configure method with crawl and init
    • Merge the start method with crawl
    • Redefine CrawlerProcess API
    • Support creating crawlers in a already setup ioloop
    • Allow simultaneous execution of crawlers
    • Drop support for scrapy.project.crawler singleton
    • Document
    • Test
  • SpiderManager
    • Instantiate with a settings instance
    • Add load method to return a spider class given its name
    • Document
    • Test
  • Integration
    • Update cmdline with the new API
    • Update commands
    • Update tests
    • Update misc components

Obsolete features

  • SpiderManager.create: Spiders are instantiated with their from_crawler method, so, since that method is implemented in the base spider, create is no longer needed.
  • Crawler.install, Crawler.uninstall, scrapy.project and scrapy.stats: Their support needed to be dropped to allow running multiple parallel crawling jobs.
  • Crawler.start: Merged with Crawler.crawl, and its previous functionality it's not easily portable to the new API.
  • Crawler._start_requests: This attribute doesn't need to be set, since now it's created and used in the same method.
  • CrawlerProcess has been substantially modified, but since it's not a public class backward compatibility wasn't provided.

Changed features

  • SpiderManager.__init__(spider_modules) -> SpiderManager.__init__(settings): The creation of the SpiderManager objects is given a settings instance rather than a list of modules to abstract custom implementations.
  • Crawler.__init__(settings) -> Crawler.__init__(spidercls, settings): Each crawler it's bound to a spider class upon instantiation. The reasons for this design decision are described in the GSoC 2014 Proposal.
  • Crawler.crawl(spider, requests=None) -> Crawler.crawl(*args, **kwargs): Crawler.crawl and Crawler.start have been merged, so Crawler.crawl returns now a deferred. The arguments provided are used to initialize the crawler's spider class. This design decision is also explained in the previously cited document.
  • scrapy.shell.inspect_response(response, spider=None) -> scrapy.shell.inspect_response(response, spider): Since we don't have the crawler singleton anymore, we have to explicitly state the crawler. The easiest way to implement this was to use the crawler attribute in the spider argument, so the later is now required.
@dangra
Copy link
Member

@dangra dangra commented Jul 24, 2014

A few comments raised by @kmike about if it is possible to use Scrapy as a library after this changes, and it directly touch this point "Merge CrawlerProcess with Crawler".

The idea is that after this changes users can import Scrapy as a library, instantiate Crawlers and run them on an already setup ioloop. It means the reactor is already up and it is managed by users code, not by the Crawler instance.

This is useful for example in a web service and starting Crawlers on demand along other things within the same python interpreter.

@curita curita changed the title [WIP] GSoC API cleanup GSoC API cleanup Aug 13, 2014
@curita curita mentioned this pull request Aug 13, 2014
@curita
Copy link
Member Author

@curita curita commented Aug 13, 2014

I've updated the list of changed and obsolete features after these commits, and submitted the #854 pull request on top of this one. It has diverged slightly from the original proposal but I think the changes make sense and came out naturally based on their implementation.

My only concern it's the modifications that I had to do to the logging system. Basically, I needed to support simultaneous LogObservers listening to different logging channels (twisted's system argument in their log events). I've made the fewest possible adjustments to accomplish this functionality, but I think that switching to the logging facility in the Python standard library is the way to go, since we'll have built-in separated channels given by different loggers, and the observers adapt almost directly to the logging handlers concept.

Obsolete features listed above weren't given backward compatibility mostly because their specific support would greatly complicate the API and this is a clean up after all. Still, I can look into bringing their support if any of the obsolete features is needed.

Despite these considerations, the pull request is pretty much finished.

@@ -77,8 +78,7 @@ how you :ref:`configure the downloader middlewares

.. attribute:: spiders

The spider manager which takes care of loading and instantiating
spiders.
The spider manager which takes care of loading spiders.

This comment has been minimized.

@kmike

kmike Aug 13, 2014
Member

I don't quite get why Crawler should know about spider managers. Crawler takes settings and a spider class and then runs a spider when asked to do so, right? Why does it matter for Crawler where the Spider class came from? Why can't it be used without a spider manager?

Are there extensions that makes use of this attribute?

This comment has been minimized.

@dangra

dangra Aug 13, 2014
Member

I see it is kept for backwards compatibility, +1 to undocument it. more comments on scrapy/crawler.py file below.

# knows where to look at)
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

This comment has been minimized.

@kmike

kmike Aug 13, 2014
Member

it'd be very nice to provide an example of how to use Crawler and Spider without Scrapy project

This comment has been minimized.

@dangra

dangra Aug 13, 2014
Member

👍 to @kmike comment.

def crawl():
for domain in ['scrapinghub.com', 'insophia.com']:
yield runner.crawl('followall', domain=domain)
reactor.stop()

This comment has been minimized.

@kmike

kmike Aug 13, 2014
Member

👍

This comment has been minimized.

@kmike

kmike Aug 13, 2014
Member

I think that the next step could be yield runner.crawl(..) returning something more useful, e.g. an iterator over scraped items, something similar to the example in tornado docs: http://www.tornadoweb.org/en/stable/guide/coroutines.html#looping

This comment has been minimized.

@dangra

dangra Aug 13, 2014
Member

It would be great, but it is out of scope. Personally I think we need a helper for this, a function that connects item_scraped signal and wrap it in an iterator. Probably better to move the discussion to its own issue :)

This comment has been minimized.

@kmike

kmike Aug 13, 2014
Member

Yes, a helper is a good idea.

I've just read the code and left some comments; I didn't mean they should be all resolved as a part of this PR.

crawler.engine = crawler._create_engine()
crawler.engine.start()

self.crawler_process._start_logging()

This comment has been minimized.

@kmike

kmike Aug 13, 2014
Member

Why is a crawler started differently here (not like in docs, using some private methods)? Can we make it work using only the public interface? If not, what do you think about adding a comment explaining why this is needed?


def set(self, name, value, priority='project'):
assert not self.frozen, "Trying to modify an immutable Settings object"

This comment has been minimized.

@kmike

kmike Aug 13, 2014
Member

I think it is better to raise some "real" exception (like TypeError) because AssertionError doesn't describe well what happens and because assert can be optimized out if Python is executed with -O command-line option.

self._start_requests = lambda: requests
# Attribute kept for backward compatibility (Use CrawlerRunner.spiders)
spman_cls = load_object(self.settings['SPIDER_MANAGER_CLASS'])
self.spiders = spman_cls.from_settings(self.settings)

This comment has been minimized.

@dangra

dangra Aug 13, 2014
Member

I am not sure we want to keep backward compatibility, but in case we do, let's implement it as a property showing a warning when used for the first time.

@curita
Copy link
Member Author

@curita curita commented Aug 14, 2014

@kmike @dangra I've implemented your suggestions.

@dangra

This comment has been minimized.

Copy link

@dangra dangra commented on tests/test_settings/__init__.py in 70f2010 Aug 14, 2014

sorry for been late, but I think RuntimeError is better than TypeError in this case, the latter is much more common mistake. @kmike what do you think?

This comment has been minimized.

Copy link

@kmike kmike replied Aug 14, 2014

I suggested TypeError because this is what immutable containers raise, e.g.

In [1]: (3,2)[0] = 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-cab3fa073914> in <module>()
----> 1 (3,2)[0] = 1

TypeError: 'tuple' object does not support item assignment

This comment has been minimized.

Copy link

@dangra dangra replied Aug 14, 2014

dangra added a commit that referenced this pull request Sep 2, 2014
GSoC API cleanup
@dangra dangra merged commit ccde331 into scrapy:master Sep 2, 2014
1 check passed
1 check passed
continuous-integration/travis-ci The Travis CI build passed
Details
@curita curita deleted the curita:api-cleanup branch Sep 6, 2014
@nyov
Copy link
Contributor

@nyov nyov commented Sep 23, 2014

Great work and big kudos! Also for taking the time to group commits by functionality, makes it easy to review those upcoming changes.

Though am I correct that having this merged should somehow reflect on the SEP19 document (marked as final/implemented)?

Thanks to all involved in making this happen, I hope it's been a blast.

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Oct 2, 2014

@nyov damn right - 993b543

@raintan

This comment has been minimized.

Copy link

@raintan raintan commented on docs/topics/practices.rst in 3547ca6 Nov 14, 2015

This comment should be removed or changed to: # 'testspider' is the name...

This comment has been minimized.

Copy link
Member Author

@curita curita replied Nov 16, 2015

Not sure if I'm misunderstanding what you're saying @raintan, but testspiders is the name of the project (which you can found here: https://github.com/scrapinghub/testspiders, should be linked in the docs as well), and followall is the name of the spider inside that project that we're using (You can check the code in the line below).

This comment has been minimized.

Copy link

@raintan raintan replied Nov 16, 2015

In the latest version of the docs, followall isn't part of the code example anymore.
see: http://doc.scrapy.org/en/1.0/topics/practices.html right after "What follows is a working example of how to do that, using the testspiders project as example."
(I'm new here, so I'm not sure if this was the proper place to put this comment.)

This comment has been minimized.

Copy link
Member Author

@curita curita replied Nov 16, 2015

Oh, I see what you meant, thanks for the notice!

This issue seems to be fixed already (that's why these docs are correct), but the latest available version of scrapy docs is this one: http://doc.scrapy.org/en/master/topics/practices.html (I know, rather confusing that is not latest). We'll see to backport the fix to 1.0 accordantly.

Generally opening an issue for any kind of bugs is the way to go, it's more flexible and github offers more features for them (can be referenced easier, we can modify their status to track the work done on them, milestones can be assigned to each of them, etc), so feel free to open an issue for your next bug report :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants