Merge pull request #816 from Curita/api-cleanup

GSoC API cleanup
scrapy · Sep 2, 2014 · ccde331 · ccde331
2 parents 620dbe7 + 6339864
commit ccde331
Show file tree

Hide file tree

Showing 57 changed files with 922 additions and 573 deletions.
diff --git a/conftest.py b/conftest.py
@@ -4,7 +4,7 @@
 
 from scrapy import optional_features
 
-collect_ignore = ["scrapy/stats.py"]
+collect_ignore = ["scrapy/stats.py", "scrapy/project.py"]
 if 'django' not in optional_features:
     collect_ignore.append("tests/test_djangoitem/models.py")
 

diff --git a/docs/faq.rst b/docs/faq.rst
@@ -280,37 +280,6 @@ I'm scraping a XML document and my XPath selector doesn't return any items
 
 You may need to remove namespaces. See :ref:`removing-namespaces`.
 
-
-I'm getting an error: "cannot import name crawler"
---------------------------------------------------
-
-This is caused by Scrapy changes due to the singletons removal. The error is
-most likely raised by a module (extension, middleware, pipeline or spider) in
-your Scrapy project that imports ``crawler`` from ``scrapy.project``. For
-example::
-
-    from scrapy.project import crawler
-
-    class SomeExtension(object):
-        def __init__(self):
-            self.crawler = crawler
-            # ...
-
-This way to access the crawler object is deprecated, the code should be ported
-to use ``from_crawler`` class method, for example::
-
-    class SomeExtension(object):
-
-        @classmethod
-        def from_crawler(cls, crawler):
-            o = cls()
-            o.crawler = crawler
-            return o
-
-Scrapy command line tool has some backwards compatibility in place to support
-the old import mechanism (with a deprecation warning), but this mechanism may
-not work if you use Scrapy differently (for example, as a library).
-
 .. _user agents: http://en.wikipedia.org/wiki/User_agent
 .. _LIFO: http://en.wikipedia.org/wiki/LIFO
 .. _DFO order: http://en.wikipedia.org/wiki/Depth-first_search

diff --git a/docs/topics/api.rst b/docs/topics/api.rst
@@ -28,9 +28,10 @@ contains a dictionary of all available extensions and their order similar to
 how you :ref:`configure the downloader middlewares
 <topics-downloader-middleware-setting>`.
 
-.. class:: Crawler(settings)
+.. class:: Crawler(spidercls, settings)
 
     The Crawler object must be instantiated with a
+    :class:`scrapy.spider.Spider` subclass and a
     :class:`scrapy.settings.Settings` object.
 
     .. attribute:: settings
@@ -75,13 +76,6 @@ how you :ref:`configure the downloader middlewares
         For an introduction on extensions and a list of available extensions on
         Scrapy see :ref:`topics-extensions`.
 
-    .. attribute:: spiders
-
-        The spider manager which takes care of loading and instantiating
-        spiders.
-
-        Most extensions won't need to access this attribute.
-
     .. attribute:: engine
 
         The execution engine, which coordinates the core crawling logic
@@ -91,18 +85,67 @@ how you :ref:`configure the downloader middlewares
         or modify the downloader and scheduler behaviour, although this is an
         advanced use and this API is not yet stable.
 
-    .. method:: configure()
+    .. attribute:: spider
 
-        Configure the crawler.
+        Spider currently being crawled. This is an instance of the spider class
+        provided while constructing the crawler, and it is created after the
+        arguments given in the :meth:`crawl` method.
 
-        This loads extensions, middlewares and spiders, leaving the crawler
-        ready to be started. It also configures the execution engine.
+    .. method:: crawl(\*args, \**kwargs)
 
-    .. method:: start()
+        Starts the crawler by instantiating its spider class with the given
+        `args` and `kwargs` arguments, while setting the execution engine in
+        motion.
 
-        Start the crawler. This calls :meth:`configure` if it hasn't been called yet.
         Returns a deferred that is fired when the crawl is finished.
 
+.. class:: CrawlerRunner(settings)
+
+    This is a convenient helper class that creates, configures and runs
+    crawlers inside an already setup Twisted `reactor`_.
+
+    The CrawlerRunner object must be instantiated with a
+    :class:`~scrapy.settings.Settings` object.
+
+    This class shouldn't be needed (since Scrapy is responsible of using it
+    accordingly) unless writing scripts that manually handle the crawling
+    process. See :ref:`run-from-script` for an example.
+
+    .. attribute:: crawlers
+
+       Set of :class:`crawlers <scrapy.crawler.Crawler>` created by the
+       :meth:`crawl` method.
+
+    .. attribute:: crawl_deferreds
+
+       Set of the `deferreds`_ return by the :meth:`crawl` method. This
+       collection it's useful for keeping track of current crawling state.
+
+    .. method:: crawl(spidercls, \*args, \**kwargs)
+
+       This method sets up the crawling of the given `spidercls` with the
+       provided arguments.
+
+       It takes care of loading the spider class while configuring and starting
+       a crawler for it.
+
+       Returns a deferred that is fired when the crawl is finished.
+
+       :param spidercls: spider class or spider's name inside the project
+       :type spidercls: :class:`~scrapy.spider.Spider` subclass or str
+
+       :param args: arguments to initializate the spider
+       :type args: list
+
+       :param kwargs: keyword arguments to initializate the spider
+       :type kwargs: dict
+
+    .. method:: stop()
+
+       Stops simultaneously all the crawling jobs taking place.
+
+       Returns a deferred that is fired when they all have ended.
+
 .. _topics-api-settings:
 
 Settings API
@@ -252,8 +295,8 @@ Settings API
 
     .. method:: getlist(name, default=None)
 
-       Get a setting value as a list. If the setting original type is a list it
-       will be returned verbatim. If it's a string it will be split by ",".
+       Get a setting value as a list. If the setting original type is a list, a
+       copy of it will be returned. If it's a string it will be split by ",".
 
        For example, settings populated through environment variables set to
        ``'one,two'`` will return a list ['one', 'two'] when using this method.
@@ -264,6 +307,90 @@ Settings API
        :param default: the value to return if no setting is found
        :type default: any
 
+    .. method:: getdict(name, default=None)
+
+       Get a setting value as a dictionary. If the setting original type is a
+       dictionary, a copy of it will be returned. If it's a string it will
+       evaluated as a json dictionary.
+
+       :param name: the setting name
+       :type name: string
+
+       :param default: the value to return if no setting is found
+       :type default: any
+
+    .. method:: copy()
+
+       Make a deep copy of current settings.
+
+       This method returns a new instance of the :class:`Settings` class,
+       populated with the same values and their priorities.
+
+       Modifications to the new object won't be reflected on the original
+       settings.
+
+    .. method:: freeze()
+
+       Disable further changes to the current settings.
+
+       After calling this method, the present state of the settings will become
+       immutable. Trying to change values through the :meth:`~set` method and
+       its variants won't be possible and will be alerted.
+
+    .. method:: frozencopy()
+
+       Return an immutable copy of the current settings.
+
+       Alias for a :meth:`~freeze` call in the object returned by :meth:`copy`
+
+.. _topics-api-spidermanager:
+
+SpiderManager API
+=================
+
+.. module:: scrapy.spidermanager
+   :synopsis: The spider manager
+
+.. class:: SpiderManager
+
+    This class is in charge of retrieving and handling the spider classes
+    defined across the project.
+
+    Custom spider managers can be employed by specifying their path in the
+    :setting:`SPIDER_MANAGER_CLASS` project setting. They must fully implement
+    the :class:`scrapy.interfaces.ISpiderManager` interface to guarantee an
+    errorless execution.
+
+    .. method:: from_settings(settings)
+
+       This class method is used by Scrapy to create an instance of the class.
+       It's called with the current project settings, and it loads the spiders
+       found in the modules of the :setting:`SPIDER_MODULES` setting.
+
+       :param settings: project settings
+       :type settings: :class:`~scrapy.settings.Settings` instance
+
+    .. method:: load(spider_name)
+
+       Get the Spider class with the given name. It'll look into the previously
+       loaded spiders for a spider class with name `spider_name` and will raise
+       a KeyError if not found.
+
+       :param spider_name: spider class name
+       :type spider_name: str
+
+    .. method:: list()
+
+       Get the names of the available spiders in the project.
+
+    .. method:: find_by_request(request)
+
+       List the spiders' names that can handle the given request. Will try to
+       match the request's url against the domains of the spiders.
+
+       :param request: queried request
+       :type request: :class:`~scrapy.http.Request` instance
+
 .. _topics-api-signals:
 
 Signals API
@@ -384,3 +511,4 @@ class (which they all inherit from).
 
 .. _deferreds: http://twistedmatrix.com/documents/current/core/howto/defer.html
 .. _deferred: http://twistedmatrix.com/documents/current/core/howto/defer.html
+.. _reactor: http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html
diff --git a/docs/topics/logging.rst b/docs/topics/logging.rst
@@ -10,7 +10,11 @@ logging`_ but this may change in the future.
 
 .. _Twisted logging: http://twistedmatrix.com/projects/core/documentation/howto/logging.html
 
-The logging service must be explicitly started through the :func:`scrapy.log.start` function.
+The logging service must be explicitly started through the
+:func:`scrapy.log.start` function to catch the top level Scrapy's log messages.
+On top of that, each crawler has its own independent log observer
+(automatically attached when it's created) that intercepts its spider's log
+messages.
 
 .. _topics-logging-levels:
 
@@ -55,8 +59,11 @@ scrapy.log module
 
 .. function:: start(logfile=None, loglevel=None, logstdout=None)
 
-    Start the logging facility. This must be called before actually logging any
-    messages. Otherwise, messages logged before this call will get lost.
+    Start the top level Scrapy logger. This must be called before actually
+    logging any top level messages (those logged using this module's
+    :func:`~scrapy.log.msg` function instead of the :meth:`Spider.log
+    <scrapy.spider.Spider.log>` method). Otherwise, messages logged before this
+    call will get lost.
 
     :param logfile: the file path to use for logging output. If omitted, the
         :setting:`LOG_FILE` setting will be used. If both are ``None``, the log