New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC API cleanup #816

Merged
merged 18 commits into from Sep 2, 2014

Conversation

Projects
None yet
6 participants
@curita
Member

curita commented Jul 24, 2014

API cleanup task of the 2014 GSoC.

Tasks Status

  • Settings
    • Freeze and copy support
    • Document
    • Test
  • Spider
    • Add from_crawler method
    • Deprecate set_crawler
    • Document
    • Test
  • Crawler
    • Instantiate with a required spider class (along the settings)
    • Crawl method will create a spider object with the provided arguments
    • Merge the configure method with crawl and init
    • Merge the start method with crawl
    • Redefine CrawlerProcess API
    • Support creating crawlers in a already setup ioloop
    • Allow simultaneous execution of crawlers
    • Drop support for scrapy.project.crawler singleton
    • Document
    • Test
  • SpiderManager
    • Instantiate with a settings instance
    • Add load method to return a spider class given its name
    • Document
    • Test
  • Integration
    • Update cmdline with the new API
    • Update commands
    • Update tests
    • Update misc components

Obsolete features

  • SpiderManager.create: Spiders are instantiated with their from_crawler method, so, since that method is implemented in the base spider, create is no longer needed.
  • Crawler.install, Crawler.uninstall, scrapy.project and scrapy.stats: Their support needed to be dropped to allow running multiple parallel crawling jobs.
  • Crawler.start: Merged with Crawler.crawl, and its previous functionality it's not easily portable to the new API.
  • Crawler._start_requests: This attribute doesn't need to be set, since now it's created and used in the same method.
  • CrawlerProcess has been substantially modified, but since it's not a public class backward compatibility wasn't provided.

Changed features

  • SpiderManager.__init__(spider_modules) -> SpiderManager.__init__(settings): The creation of the SpiderManager objects is given a settings instance rather than a list of modules to abstract custom implementations.
  • Crawler.__init__(settings) -> Crawler.__init__(spidercls, settings): Each crawler it's bound to a spider class upon instantiation. The reasons for this design decision are described in the GSoC 2014 Proposal.
  • Crawler.crawl(spider, requests=None) -> Crawler.crawl(*args, **kwargs): Crawler.crawl and Crawler.start have been merged, so Crawler.crawl returns now a deferred. The arguments provided are used to initialize the crawler's spider class. This design decision is also explained in the previously cited document.
  • scrapy.shell.inspect_response(response, spider=None) -> scrapy.shell.inspect_response(response, spider): Since we don't have the crawler singleton anymore, we have to explicitly state the crawler. The easiest way to implement this was to use the crawler attribute in the spider argument, so the later is now required.
@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Jul 24, 2014

Member

A few comments raised by @kmike about if it is possible to use Scrapy as a library after this changes, and it directly touch this point "Merge CrawlerProcess with Crawler".

The idea is that after this changes users can import Scrapy as a library, instantiate Crawlers and run them on an already setup ioloop. It means the reactor is already up and it is managed by users code, not by the Crawler instance.

This is useful for example in a web service and starting Crawlers on demand along other things within the same python interpreter.

Member

dangra commented Jul 24, 2014

A few comments raised by @kmike about if it is possible to use Scrapy as a library after this changes, and it directly touch this point "Merge CrawlerProcess with Crawler".

The idea is that after this changes users can import Scrapy as a library, instantiate Crawlers and run them on an already setup ioloop. It means the reactor is already up and it is managed by users code, not by the Crawler instance.

This is useful for example in a web service and starting Crawlers on demand along other things within the same python interpreter.

@curita curita changed the title from [WIP] GSoC API cleanup to GSoC API cleanup Aug 13, 2014

@curita curita referenced this pull request Aug 13, 2014

Closed

GSoC Per-spider settings #854

@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Aug 13, 2014

Member

I've updated the list of changed and obsolete features after these commits, and submitted the #854 pull request on top of this one. It has diverged slightly from the original proposal but I think the changes make sense and came out naturally based on their implementation.

My only concern it's the modifications that I had to do to the logging system. Basically, I needed to support simultaneous LogObservers listening to different logging channels (twisted's system argument in their log events). I've made the fewest possible adjustments to accomplish this functionality, but I think that switching to the logging facility in the Python standard library is the way to go, since we'll have built-in separated channels given by different loggers, and the observers adapt almost directly to the logging handlers concept.

Obsolete features listed above weren't given backward compatibility mostly because their specific support would greatly complicate the API and this is a clean up after all. Still, I can look into bringing their support if any of the obsolete features is needed.

Despite these considerations, the pull request is pretty much finished.

Member

curita commented Aug 13, 2014

I've updated the list of changed and obsolete features after these commits, and submitted the #854 pull request on top of this one. It has diverged slightly from the original proposal but I think the changes make sense and came out naturally based on their implementation.

My only concern it's the modifications that I had to do to the logging system. Basically, I needed to support simultaneous LogObservers listening to different logging channels (twisted's system argument in their log events). I've made the fewest possible adjustments to accomplish this functionality, but I think that switching to the logging facility in the Python standard library is the way to go, since we'll have built-in separated channels given by different loggers, and the observers adapt almost directly to the logging handlers concept.

Obsolete features listed above weren't given backward compatibility mostly because their specific support would greatly complicate the API and this is a clean up after all. Still, I can look into bringing their support if any of the obsolete features is needed.

Despite these considerations, the pull request is pretty much finished.

Show outdated Hide outdated docs/topics/api.rst Outdated
Show outdated Hide outdated docs/topics/practices.rst Outdated
def crawl():
for domain in ['scrapinghub.com', 'insophia.com']:
yield runner.crawl('followall', domain=domain)
reactor.stop()

This comment has been minimized.

@kmike

kmike Aug 13, 2014

Member

👍

@kmike

kmike Aug 13, 2014

Member

👍

This comment has been minimized.

@kmike

kmike Aug 13, 2014

Member

I think that the next step could be yield runner.crawl(..) returning something more useful, e.g. an iterator over scraped items, something similar to the example in tornado docs: http://www.tornadoweb.org/en/stable/guide/coroutines.html#looping

@kmike

kmike Aug 13, 2014

Member

I think that the next step could be yield runner.crawl(..) returning something more useful, e.g. an iterator over scraped items, something similar to the example in tornado docs: http://www.tornadoweb.org/en/stable/guide/coroutines.html#looping

This comment has been minimized.

@dangra

dangra Aug 13, 2014

Member

It would be great, but it is out of scope. Personally I think we need a helper for this, a function that connects item_scraped signal and wrap it in an iterator. Probably better to move the discussion to its own issue :)

@dangra

dangra Aug 13, 2014

Member

It would be great, but it is out of scope. Personally I think we need a helper for this, a function that connects item_scraped signal and wrap it in an iterator. Probably better to move the discussion to its own issue :)

This comment has been minimized.

@kmike

kmike Aug 13, 2014

Member

Yes, a helper is a good idea.

I've just read the code and left some comments; I didn't mean they should be all resolved as a part of this PR.

@kmike

kmike Aug 13, 2014

Member

Yes, a helper is a good idea.

I've just read the code and left some comments; I didn't mean they should be all resolved as a part of this PR.

Show outdated Hide outdated scrapy/commands/shell.py Outdated
Show outdated Hide outdated scrapy/settings/__init__.py Outdated
Show outdated Hide outdated scrapy/crawler.py Outdated
@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Aug 14, 2014

Member

@kmike @dangra I've implemented your suggestions.

Member

curita commented Aug 14, 2014

@kmike @dangra I've implemented your suggestions.

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Aug 14, 2014

sorry for been late, but I think RuntimeError is better than TypeError in this case, the latter is much more common mistake. @kmike what do you think?

dangra commented on tests/test_settings/__init__.py in 70f2010 Aug 14, 2014

sorry for been late, but I think RuntimeError is better than TypeError in this case, the latter is much more common mistake. @kmike what do you think?

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Aug 14, 2014

I suggested TypeError because this is what immutable containers raise, e.g.

In [1]: (3,2)[0] = 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-cab3fa073914> in <module>()
----> 1 (3,2)[0] = 1

TypeError: 'tuple' object does not support item assignment

kmike replied Aug 14, 2014

I suggested TypeError because this is what immutable containers raise, e.g.

In [1]: (3,2)[0] = 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-cab3fa073914> in <module>()
----> 1 (3,2)[0] = 1

TypeError: 'tuple' object does not support item assignment

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Aug 14, 2014

dangra replied Aug 14, 2014

dangra added a commit that referenced this pull request Sep 2, 2014

@dangra dangra merged commit ccde331 into scrapy:master Sep 2, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

@curita curita deleted the curita:api-cleanup branch Sep 6, 2014

@nyov

This comment has been minimized.

Show comment
Hide comment
@nyov

nyov Sep 23, 2014

Contributor

Great work and big kudos! Also for taking the time to group commits by functionality, makes it easy to review those upcoming changes.

Though am I correct that having this merged should somehow reflect on the SEP19 document (marked as final/implemented)?

Thanks to all involved in making this happen, I hope it's been a blast.

Contributor

nyov commented Sep 23, 2014

Great work and big kudos! Also for taking the time to group commits by functionality, makes it easy to review those upcoming changes.

Though am I correct that having this merged should somehow reflect on the SEP19 document (marked as final/implemented)?

Thanks to all involved in making this happen, I hope it's been a blast.

@pablohoffman

This comment has been minimized.

Show comment
Hide comment
@pablohoffman

pablohoffman Oct 2, 2014

Member

@nyov damn right - 993b543

Member

pablohoffman commented Oct 2, 2014

@nyov damn right - 993b543

@raintan

This comment has been minimized.

Show comment
Hide comment
@raintan

raintan Nov 14, 2015

This comment should be removed or changed to: # 'testspider' is the name...

raintan commented on docs/topics/practices.rst in 3547ca6 Nov 14, 2015

This comment should be removed or changed to: # 'testspider' is the name...

This comment has been minimized.

Show comment
Hide comment
@curita

curita Nov 16, 2015

Member

Not sure if I'm misunderstanding what you're saying @raintan, but testspiders is the name of the project (which you can found here: https://github.com/scrapinghub/testspiders, should be linked in the docs as well), and followall is the name of the spider inside that project that we're using (You can check the code in the line below).

Member

curita replied Nov 16, 2015

Not sure if I'm misunderstanding what you're saying @raintan, but testspiders is the name of the project (which you can found here: https://github.com/scrapinghub/testspiders, should be linked in the docs as well), and followall is the name of the spider inside that project that we're using (You can check the code in the line below).

This comment has been minimized.

Show comment
Hide comment
@raintan

raintan Nov 16, 2015

In the latest version of the docs, followall isn't part of the code example anymore.
see: http://doc.scrapy.org/en/1.0/topics/practices.html right after "What follows is a working example of how to do that, using the testspiders project as example."
(I'm new here, so I'm not sure if this was the proper place to put this comment.)

raintan replied Nov 16, 2015

In the latest version of the docs, followall isn't part of the code example anymore.
see: http://doc.scrapy.org/en/1.0/topics/practices.html right after "What follows is a working example of how to do that, using the testspiders project as example."
(I'm new here, so I'm not sure if this was the proper place to put this comment.)

This comment has been minimized.

Show comment
Hide comment
@curita

curita Nov 16, 2015

Member

Oh, I see what you meant, thanks for the notice!

This issue seems to be fixed already (that's why these docs are correct), but the latest available version of scrapy docs is this one: http://doc.scrapy.org/en/master/topics/practices.html (I know, rather confusing that is not latest). We'll see to backport the fix to 1.0 accordantly.

Generally opening an issue for any kind of bugs is the way to go, it's more flexible and github offers more features for them (can be referenced easier, we can modify their status to track the work done on them, milestones can be assigned to each of them, etc), so feel free to open an issue for your next bug report :)

Member

curita replied Nov 16, 2015

Oh, I see what you meant, thanks for the notice!

This issue seems to be fixed already (that's why these docs are correct), but the latest available version of scrapy docs is this one: http://doc.scrapy.org/en/master/topics/practices.html (I know, rather confusing that is not latest). We'll see to backport the fix to 1.0 accordantly.

Generally opening an issue for any kind of bugs is the way to go, it's more flexible and github offers more features for them (can be referenced easier, we can modify their status to track the work done on them, milestones can be assigned to each of them, etc), so feel free to open an issue for your next bug report :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment