Skip to content

Commit

Permalink
Merge pull request #816 from Curita/api-cleanup
Browse files Browse the repository at this point in the history
GSoC API cleanup
  • Loading branch information
dangra committed Sep 2, 2014
2 parents 620dbe7 + 6339864 commit ccde331
Show file tree
Hide file tree
Showing 57 changed files with 922 additions and 573 deletions.
2 changes: 1 addition & 1 deletion conftest.py
Expand Up @@ -4,7 +4,7 @@

from scrapy import optional_features

collect_ignore = ["scrapy/stats.py"]
collect_ignore = ["scrapy/stats.py", "scrapy/project.py"]
if 'django' not in optional_features:
collect_ignore.append("tests/test_djangoitem/models.py")

Expand Down
31 changes: 0 additions & 31 deletions docs/faq.rst
Expand Up @@ -280,37 +280,6 @@ I'm scraping a XML document and my XPath selector doesn't return any items

You may need to remove namespaces. See :ref:`removing-namespaces`.


I'm getting an error: "cannot import name crawler"
--------------------------------------------------

This is caused by Scrapy changes due to the singletons removal. The error is
most likely raised by a module (extension, middleware, pipeline or spider) in
your Scrapy project that imports ``crawler`` from ``scrapy.project``. For
example::

from scrapy.project import crawler

class SomeExtension(object):
def __init__(self):
self.crawler = crawler
# ...

This way to access the crawler object is deprecated, the code should be ported
to use ``from_crawler`` class method, for example::

class SomeExtension(object):

@classmethod
def from_crawler(cls, crawler):
o = cls()
o.crawler = crawler
return o

Scrapy command line tool has some backwards compatibility in place to support
the old import mechanism (with a deprecation warning), but this mechanism may
not work if you use Scrapy differently (for example, as a library).

.. _user agents: http://en.wikipedia.org/wiki/User_agent
.. _LIFO: http://en.wikipedia.org/wiki/LIFO
.. _DFO order: http://en.wikipedia.org/wiki/Depth-first_search
Expand Down
160 changes: 144 additions & 16 deletions docs/topics/api.rst
Expand Up @@ -28,9 +28,10 @@ contains a dictionary of all available extensions and their order similar to
how you :ref:`configure the downloader middlewares
<topics-downloader-middleware-setting>`.

.. class:: Crawler(settings)
.. class:: Crawler(spidercls, settings)

The Crawler object must be instantiated with a
:class:`scrapy.spider.Spider` subclass and a
:class:`scrapy.settings.Settings` object.

.. attribute:: settings
Expand Down Expand Up @@ -75,13 +76,6 @@ how you :ref:`configure the downloader middlewares
For an introduction on extensions and a list of available extensions on
Scrapy see :ref:`topics-extensions`.

.. attribute:: spiders

The spider manager which takes care of loading and instantiating
spiders.

Most extensions won't need to access this attribute.

.. attribute:: engine

The execution engine, which coordinates the core crawling logic
Expand All @@ -91,18 +85,67 @@ how you :ref:`configure the downloader middlewares
or modify the downloader and scheduler behaviour, although this is an
advanced use and this API is not yet stable.

.. method:: configure()
.. attribute:: spider

Configure the crawler.
Spider currently being crawled. This is an instance of the spider class
provided while constructing the crawler, and it is created after the
arguments given in the :meth:`crawl` method.

This loads extensions, middlewares and spiders, leaving the crawler
ready to be started. It also configures the execution engine.
.. method:: crawl(\*args, \**kwargs)

.. method:: start()
Starts the crawler by instantiating its spider class with the given
`args` and `kwargs` arguments, while setting the execution engine in
motion.

Start the crawler. This calls :meth:`configure` if it hasn't been called yet.
Returns a deferred that is fired when the crawl is finished.

.. class:: CrawlerRunner(settings)

This is a convenient helper class that creates, configures and runs
crawlers inside an already setup Twisted `reactor`_.

The CrawlerRunner object must be instantiated with a
:class:`~scrapy.settings.Settings` object.

This class shouldn't be needed (since Scrapy is responsible of using it
accordingly) unless writing scripts that manually handle the crawling
process. See :ref:`run-from-script` for an example.

.. attribute:: crawlers

Set of :class:`crawlers <scrapy.crawler.Crawler>` created by the
:meth:`crawl` method.

.. attribute:: crawl_deferreds

Set of the `deferreds`_ return by the :meth:`crawl` method. This
collection it's useful for keeping track of current crawling state.

.. method:: crawl(spidercls, \*args, \**kwargs)

This method sets up the crawling of the given `spidercls` with the
provided arguments.

It takes care of loading the spider class while configuring and starting
a crawler for it.

Returns a deferred that is fired when the crawl is finished.

:param spidercls: spider class or spider's name inside the project
:type spidercls: :class:`~scrapy.spider.Spider` subclass or str

:param args: arguments to initializate the spider
:type args: list

:param kwargs: keyword arguments to initializate the spider
:type kwargs: dict

.. method:: stop()

Stops simultaneously all the crawling jobs taking place.

Returns a deferred that is fired when they all have ended.

.. _topics-api-settings:

Settings API
Expand Down Expand Up @@ -252,8 +295,8 @@ Settings API

.. method:: getlist(name, default=None)

Get a setting value as a list. If the setting original type is a list it
will be returned verbatim. If it's a string it will be split by ",".
Get a setting value as a list. If the setting original type is a list, a
copy of it will be returned. If it's a string it will be split by ",".

For example, settings populated through environment variables set to
``'one,two'`` will return a list ['one', 'two'] when using this method.
Expand All @@ -264,6 +307,90 @@ Settings API
:param default: the value to return if no setting is found
:type default: any

.. method:: getdict(name, default=None)

Get a setting value as a dictionary. If the setting original type is a
dictionary, a copy of it will be returned. If it's a string it will
evaluated as a json dictionary.

:param name: the setting name
:type name: string

:param default: the value to return if no setting is found
:type default: any

.. method:: copy()

Make a deep copy of current settings.

This method returns a new instance of the :class:`Settings` class,
populated with the same values and their priorities.

Modifications to the new object won't be reflected on the original
settings.

.. method:: freeze()

Disable further changes to the current settings.

After calling this method, the present state of the settings will become
immutable. Trying to change values through the :meth:`~set` method and
its variants won't be possible and will be alerted.

.. method:: frozencopy()

Return an immutable copy of the current settings.

Alias for a :meth:`~freeze` call in the object returned by :meth:`copy`

.. _topics-api-spidermanager:

SpiderManager API
=================

.. module:: scrapy.spidermanager
:synopsis: The spider manager

.. class:: SpiderManager

This class is in charge of retrieving and handling the spider classes
defined across the project.

Custom spider managers can be employed by specifying their path in the
:setting:`SPIDER_MANAGER_CLASS` project setting. They must fully implement
the :class:`scrapy.interfaces.ISpiderManager` interface to guarantee an
errorless execution.

.. method:: from_settings(settings)

This class method is used by Scrapy to create an instance of the class.
It's called with the current project settings, and it loads the spiders
found in the modules of the :setting:`SPIDER_MODULES` setting.

:param settings: project settings
:type settings: :class:`~scrapy.settings.Settings` instance

.. method:: load(spider_name)

Get the Spider class with the given name. It'll look into the previously
loaded spiders for a spider class with name `spider_name` and will raise
a KeyError if not found.

:param spider_name: spider class name
:type spider_name: str

.. method:: list()

Get the names of the available spiders in the project.

.. method:: find_by_request(request)

List the spiders' names that can handle the given request. Will try to
match the request's url against the domains of the spiders.

:param request: queried request
:type request: :class:`~scrapy.http.Request` instance

.. _topics-api-signals:

Signals API
Expand Down Expand Up @@ -384,3 +511,4 @@ class (which they all inherit from).

.. _deferreds: http://twistedmatrix.com/documents/current/core/howto/defer.html
.. _deferred: http://twistedmatrix.com/documents/current/core/howto/defer.html
.. _reactor: http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html
13 changes: 10 additions & 3 deletions docs/topics/logging.rst
Expand Up @@ -10,7 +10,11 @@ logging`_ but this may change in the future.

.. _Twisted logging: http://twistedmatrix.com/projects/core/documentation/howto/logging.html

The logging service must be explicitly started through the :func:`scrapy.log.start` function.
The logging service must be explicitly started through the
:func:`scrapy.log.start` function to catch the top level Scrapy's log messages.
On top of that, each crawler has its own independent log observer
(automatically attached when it's created) that intercepts its spider's log
messages.

.. _topics-logging-levels:

Expand Down Expand Up @@ -55,8 +59,11 @@ scrapy.log module

.. function:: start(logfile=None, loglevel=None, logstdout=None)

Start the logging facility. This must be called before actually logging any
messages. Otherwise, messages logged before this call will get lost.
Start the top level Scrapy logger. This must be called before actually
logging any top level messages (those logged using this module's
:func:`~scrapy.log.msg` function instead of the :meth:`Spider.log
<scrapy.spider.Spider.log>` method). Otherwise, messages logged before this
call will get lost.

:param logfile: the file path to use for logging output. If omitted, the
:setting:`LOG_FILE` setting will be used. If both are ``None``, the log
Expand Down

0 comments on commit ccde331

Please sign in to comment.