Skip to content
Permalink
Browse files

Merge branch 'singleton_removal'

  • Loading branch information...
Pablo Hoffman
Pablo Hoffman committed Aug 28, 2012
2 parents c95f717 + babfc6e commit da234af45640b62049dc9036d979530792ffc049
Showing with 714 additions and 624 deletions.
  1. +1 −2 docs/experimental/djangoitems.rst
  2. +4 −2 docs/index.rst
  3. +4 −0 docs/news.rst
  4. +333 −0 docs/topics/api.rst
  5. +2 −1 docs/topics/architecture.rst
  6. +1 −1 docs/topics/email.rst
  7. +7 −3 docs/topics/exporters.rst
  8. +51 −83 docs/topics/extensions.rst
  9. +2 −2 docs/topics/firebug.rst
  10. +21 −31 docs/topics/item-pipeline.rst
  11. +4 −2 docs/topics/request-response.rst
  12. +21 −157 docs/topics/settings.rst
  13. +2 −5 docs/topics/signals.rst
  14. +18 −113 docs/topics/stats.rst
  15. +8 −8 docs/topics/telnetconsole.rst
  16. +4 −5 scrapy/contrib/closespider.py
  17. +17 −14 scrapy/contrib/corestats.py
  18. +9 −9 scrapy/contrib/downloadermiddleware/httpcache.py
  19. +2 −4 scrapy/contrib/downloadermiddleware/robotstxt.py
  20. +12 −10 scrapy/contrib/downloadermiddleware/stats.py
  21. +8 −4 scrapy/contrib/feedexport.py
  22. +14 −7 scrapy/contrib/logstats.py
  23. +9 −10 scrapy/contrib/memdebug.py
  24. +10 −12 scrapy/contrib/memusage.py
  25. +2 −9 scrapy/contrib/pipeline/images.py
  26. +3 −10 scrapy/contrib/spidermiddleware/depth.py
  27. +7 −3 scrapy/contrib/spidermiddleware/offsite.py
  28. +2 −3 scrapy/contrib/spiderstate.py
  29. +9 −10 scrapy/contrib/statsmailer.py
  30. +2 −3 scrapy/contrib/throttle.py
  31. +1 −2 scrapy/contrib/webservice/stats.py
  32. +2 −2 scrapy/core/downloader/__init__.py
  33. +12 −12 scrapy/core/engine.py
  34. +9 −6 scrapy/core/scheduler.py
  35. +6 −7 scrapy/core/scraper.py
  36. +4 −2 scrapy/crawler.py
  37. +4 −3 scrapy/mail.py
  38. +0 −1 scrapy/settings/default_settings.py
  39. +27 −0 scrapy/signalmanager.py
  40. +3 −3 scrapy/spidermanager.py
  41. +6 −8 scrapy/stats.py
  42. +8 −12 scrapy/statscol.py
  43. +4 −7 scrapy/telnet.py
  44. +7 −6 scrapy/tests/test_downloadermiddleware_httpcache.py
  45. +8 −7 scrapy/tests/test_downloadermiddleware_stats.py
  46. +3 −3 scrapy/tests/test_engine.py
  47. +6 −6 scrapy/tests/test_mail.py
  48. +2 −1 scrapy/tests/test_spidermiddleware_depth.py
  49. +11 −10 scrapy/tests/test_stats.py
  50. +2 −3 scrapy/webservice.py
@@ -77,8 +77,7 @@ As said before, we can add other fields to the item::
p['age'] = '22'
p['sex'] = 'M'

.. note:: fields added to the item won't be taken into account when doing a
:meth:`~DjangoItem.save`
.. note:: fields added to the item won't be taken into account when doing a :meth:`~DjangoItem.save`

And we can override the fields of the model with your own::

@@ -176,6 +176,7 @@ Extending Scrapy
topics/downloader-middleware
topics/spider-middleware
topics/extensions
topics/api

:doc:`topics/architecture`
Understand the Scrapy architecture.
@@ -187,9 +188,10 @@ Extending Scrapy
Customize the input and output of your spiders.

:doc:`topics/extensions`
Add any custom functionality using :doc:`signals <topics/signals>` and the
Scrapy API
Extend Scrapy with your custom functionality

:doc:`topics/api`
Use it on extensions and middlewares to extend Scrapy functionality

Reference
=========
@@ -6,6 +6,9 @@ Release notes

Scrapy changes:

- dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
- dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info.
- documented :ref:`topics-api`
- `lxml` is now the default selectors backend instead of `libxml2`
- ported FormRequest.from_response() to use `lxml`_ instead of `ClientForm`_
- removed modules: ``scrapy.xlib.BeautifulSoup`` and ``scrapy.xlib.ClientForm``
@@ -22,6 +25,7 @@ Scrapy changes:
- removed per-spider settings (to be replaced by instantiating multiple crawler objects)
- ``USER_AGENT`` spider attribute will no longer work, use ``user_agent`` attribute instead
- ``DOWNLOAD_TIMEOUT`` spider attribute will no longer work, use ``download_timeout`` attribute instead
- removed ``ENCODING_ALIASES`` setting, as encoding auto-detection has been moved to the `w3lib`_ library

Scrapyd changes:

@@ -0,0 +1,333 @@
.. _topics-api:

========
Core API
========

.. versionadded:: 0.15

This section documents the Scrapy core API, and it's intended for developers of
extensions and middlewares.

.. _topics-api-crawler:

Crawler API
===========

The main entry point to Scrapy API is the :class:`~scrapy.crawler.Crawler`
object, passed to extensions through the ``from_crawler`` class method. This
object provides access to all Scrapy core components, and it's the only way for
extensions to access them and hook their functionality into Scrapy.

.. module:: scrapy.crawler
:synopsis: The Scrapy crawler

The Extension Manager is responsible for loading and keeping track of installed
extensions and it's configured through the :setting:`EXTENSIONS` setting which
contains a dictionary of all available extensions and their order similar to
how you :ref:`configure the downloader middlewares
<topics-downloader-middleware-setting>`.

.. class:: Crawler(settings)

The Crawler object must be instantiated with a
:class:`scrapy.settings.Settings` object.

.. attribute:: settings

The settings manager of this crawler.

This is used by extensions & middlewares to access the Scrapy settings
of this crawler.

For an introduction on Scrapy settings see :ref:`topics-settings`.

For the API see :class:`~scrapy.settings.Settings` class.

.. attribute:: signals

The signals manager of this crawler.

This is used by extensions & middlewares to hook themselves into Scrapy
functionality.

For an introduction on signals see :ref:`topics-signals`.

For the API see :class:`~scrapy.signalmanager.SignalManager` class.

.. attribute:: stats

The stats collector of this crawler.

This is used from extensions & middlewares to record stats of their
behaviour, or access stats collected by other extensions.

For an introduction on stats collection see :ref:`topics-stats`.

For the API see :class:`~scrapy.statscol.StatsCollector` class.

.. attribute:: extensions

The extension manager that keeps track of enabled extensions.

Most extensions won't need to access this attribute.

For an introduction on extensions and a list of available extensions on
Scrapy see :ref:`topics-extensions`.

.. attribute:: spiders

The spider manager which takes care of loading and instantiating
spiders.

Most extensions won't need to access this attribute.

.. attribute:: engine

The execution engine, which coordinates the core crawling logic
between the scheduler, downloader and spiders.

Some extension may want to access the Scrapy engine, to modify inspect
or modify the downloader and scheduler behaviour, although this is an
advanced use and this API is not yet stable.

.. method:: configure()

Configure the crawler.

This loads extensions, middlewares and spiders, leaving the crawler
ready to be started. It also configures the execution engine.

.. method:: start()

Start the crawler. This calss :meth:`configure` if it hasn't been called yet.

Settings API
============

.. module:: scrapy.settings
:synopsis: Settings manager

.. class:: Settings()

This object that provides access to Scrapy settings.

.. attribute:: overrides

Global overrides are the ones that take most precedence, and are usually
populated by command-line options.

Overrides should be populated *before* configuring the Crawler object
(through the :meth:`~scrapy.crawler.Crawler.configure` method),
otherwise they won't have any effect. You don't typically need to worry
about overrides unless you are implementing your own Scrapy command.

.. method:: get(name, default=None)

Get a setting value without affecting its original type.

:param name: the setting name
:type name: string

:param default: the value to return if no setting is found
:type default: any

.. method:: getbool(name, default=False)

Get a setting value as a boolean. For example, both ``1`` and ``'1'``, and
``True`` return ``True``, while ``0``, ``'0'``, ``False`` and ``None``
return ``False````

For example, settings populated through environment variables set to ``'0'``
will return ``False`` when using this method.

:param name: the setting name
:type name: string

:param default: the value to return if no setting is found
:type default: any

.. method:: getint(name, default=0)

Get a setting value as an int

:param name: the setting name
:type name: string

:param default: the value to return if no setting is found
:type default: any

.. method:: getfloat(name, default=0.0)

Get a setting value as a float

:param name: the setting name
:type name: string

:param default: the value to return if no setting is found
:type default: any

.. method:: getlist(name, default=None)

Get a setting value as a list. If the setting original type is a list it
will be returned verbatim. If it's a string it will be split by ",".

For example, settings populated through environment variables set to
``'one,two'`` will return a list ['one', 'two'] when using this method.

:param name: the setting name
:type name: string

:param default: the value to return if no setting is found
:type default: any

.. _topics-api-signals:

Signals API
===========

.. module:: scrapy.signalmanager
:synopsis: The signal manager

.. class:: SignalManager

.. method:: connect(receiver, signal)

Connect a receiver function to a signal.

The signal can be any object, although Scrapy comes with some
predefined signals that are documented in the :ref:`topics-signals`
section.

:param receiver: the function to be connected
:type receiver: callable

:param signal: the signal to connect to
:type signal: object

.. method:: send_catch_log(signal, \*\*kwargs)

Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected
through the :meth:`connect` method).

.. method:: send_catch_log_deferred(signal, \*\*kwargs)

Like :meth:`send_catch_log` but supports returning `deferreds`_ from
signal handlers.

Returns a `deferred`_ that gets fired once all signal handlers
deferreds were fired. Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected
through the :meth:`connect` method).

.. method:: disconnect(receiver, signal)

Disconnect a receiver function from a signal. This has the opposite
effect of the :meth:`connect` method, and the arguments are the same.

.. method:: disconnect_all(signal)

Disconnect all receivers from the given signal.

:param signal: the signal to disconnect from
:type signal: object

.. _topics-api-stats:

Stats Collector API
===================

There are several Stats Collectors available under the
:mod:`scrapy.statscol` module and they all implement the Stats
Collector API defined by the :class:`~scrapy.statscol.StatsCollector`
class (which they all inherit from).

.. module:: scrapy.statscol
:synopsis: Stats Collectors

.. class:: StatsCollector

.. method:: get_value(key, default=None, spider=None)

Return the value for the given stats key or default if it doesn't exist.
If spider is ``None`` the global stats table is consulted, otherwise the
spider specific one is. If the spider is not yet opened a ``KeyError``
exception is raised.

.. method:: get_stats(spider=None)

Get all stats from the given spider (if spider is given) or all global
stats otherwise, as a dict. If spider is not opened ``KeyError`` is
raised.

.. method:: set_value(key, value, spider=None)

Set the given value for the given stats key on the global stats (if
spider is not given) or the spider-specific stats (if spider is given),
which must be opened or a ``KeyError`` will be raised.

.. method:: set_stats(stats, spider=None)

Set the given stats (as a dict) for the given spider. If the spider is
not opened a ``KeyError`` will be raised.

.. method:: inc_value(key, count=1, start=0, spider=None)

Increment the value of the given stats key, by the given count,
assuming the start value given (when it's not set). If spider is not
given the global stats table is used, otherwise the spider-specific
stats table is used, which must be opened or a ``KeyError`` will be
raised.

.. method:: max_value(key, value, spider=None)

Set the given value for the given key only if current value for the
same key is lower than value. If there is no current value for the
given key, the value is always set. If spider is not given, the global
stats table is used, otherwise the spider-specific stats table is used,
which must be opened or a KeyError will be raised.

.. method:: min_value(key, value, spider=None)

Set the given value for the given key only if current value for the
same key is greater than value. If there is no current value for the
given key, the value is always set. If spider is not given, the global
stats table is used, otherwise the spider-specific stats table is used,
which must be opened or a KeyError will be raised.

.. method:: clear_stats(spider=None)

Clear all global stats (if spider is not given) or all spider-specific
stats if spider is given, in which case it must be opened or a
``KeyError`` will be raised.

.. method:: iter_spider_stats()

Return a iterator over ``(spider, spider_stats)`` for each open spider
currently tracked by the stats collector, where ``spider_stats`` is the
dict containing all spider-specific stats.

Global stats are not included in the iterator. If you want to get
those, use :meth:`get_stats` method.

The following methods are not part of the stats collection api but instead
used when implementing custom stats collectors:

.. method:: open_spider(spider)

Open the given spider for stats collection.

.. method:: close_spider(spider)

Close the given spider. After this is called, no more specific stats
for this spider can be accessed.

.. method:: engine_stopped()

Called after the engine is stopped, to dump or persist global stats.

.. _deferreds: http://twistedmatrix.com/documents/current/core/howto/defer.html
.. _deferred: http://twistedmatrix.com/documents/current/core/howto/defer.html
@@ -80,7 +80,8 @@ functionality by plugging custom code. For more information see
Data flow
=========

The data flow in Scrapy is controlled by the Engine, and goes like this:
The data flow in Scrapy is controlled by the execution engine, and goes like
this:

1. The Engine opens a domain, locates the Spider that handles that domain, and
asks the spider for the first URLs to crawl.

0 comments on commit da234af

Please sign in to comment.
You can’t perform that action at this time.